A Spoken Document Retrieval System for TV Broadcast News in Spanish and Basque

نویسندگان

  • Amparo Varona
  • Silvia Nieto
  • Luis Javier Rodríguez-Fuentes
  • Mikel Peñagarikano
  • Germán Bordel
  • Mireia Díez
چکیده

This paper presents a spoken document retrieval system (Hearch) looking like a conventional search tool, which retrieves audio/video segments based on the automatic transcription of speech contents. The system consists of a backend that captures, processes and indexes audio/video resources, and a front-end that allows to search contents, configure various modules and display performance statistics through a web interface. An early version of this tool is available (http://gtts.ehu.es/Hearch/ ), which searches and retrieves segments on TV broadcast news repositories in Spanish and Basque. To evaluate the performance of the system, six manually transcribed TV broadcast news in Spanish and seven in Basque have been used. An approach based on extending the query with the so called friendly terms has been proposed and evaluated, attempting to minimize the effect of errors introduced by the Automatic Speech Recognition module. This approach led to slight performance improvements.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The need to create a media block for the convergence of overseas news networks

As a general diplomacy arm of the Islamic Republic of Iran, VoSiMa has extensive activities in international broadcasting of its radio and television programs. These programs are broadcast in different languages, such as English, French, Azeri, Arabic, and ... for regional and transnational audiences. The large volume of the organization's international activities is in the form of news and new...

متن کامل

Search and access to information contained in the speech of multimedia resources

The main goal of this project is to make scientific contributions and technological improvements related to the spoken document retrieval system (Hearch) developed by the Working Group on Software Technologies of the University of the Basque Country. Hearch looks like a conventional search tool (such as Google, Bing, etc) but it is designed to retrieve audio/video segments based on the automati...

متن کامل

Recognition, indexing and retrieval of british broadcast news with the THISL system

This paper described the THISL spoken document retrieval system for British and North American Broadcast News. The system is based on the ABBOT large vocabulary speech recognizer and a probabilistic text retrieval system. We discuss the development of a realtime British English Broadcast News system, and its integration into a spoken document retrieval system. Detailed evaluation is performed u...

متن کامل

Development of Resources for a Bilingual Automatic Index System of Broadcast News in Basque and Spanish

The development of an automatic index system of broadcast news requires appropriate Video and Language Resources (LR) to design all the components of the system. Nowadays, large and well-defined resources can be found in most widely used languages (Informedia), but there is a lot of work to do with respect to minority languages. The main goal of this work is the design of resources in Basque an...

متن کامل

The Cambridge University spoken document retrieval system

This paper describes the spoken document retrieval system that we have been developing and assesses its performance using automatic transcriptions of about 50 hours of broadcast news data. The recognition engine is based on the HTK broadcast news transcription system and the retrieval engine is based on the techniques developed at City University. The retrieval performance over a wide range of ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Procesamiento del Lenguaje Natural

دوره 47  شماره 

صفحات  -

تاریخ انتشار 2011